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1; 

flj ' We describe algorithmic results on two crucial aspects of allocating resources on computational hard- 

^O I ware devices with partial reconfigurability. By using methods from the field of computational geometry, 

we derive a method that allows correct maintainance of free and occupied space of a set of n rectangular 
modules in time 0(7ilogn); previous approaches needed a time of 0(n^) for correct results and 0{n) 
for heuristic results. We also show a matching lower bound of Q.{n\ogn), so our approach is optimal. 
I I . We also show that finding an optimal feasible communication-conscious placement (which minimizes the 

^0 ' total weighted Manhattan distance between the new module and existing demand points) can be com- 

^*^ , puted with 0(nlogn). Both resulting algorithms are practically easy to implement and show convincing 

• ' experimental behavior. 

I ^i ' ACM Classification: C.3.e: Reconfigurable Hardware; 

F.2.2.C: Geometrical problems and computations 
C^ ' Keywords: Reconfigurable hardware, field-programmable gate array (FPGA), module placement, free 

^ , space manager, routing-conscious placement, geometric optimization, line sweep technique, optimal running 

jL: ' time, lower bounds. 

O; 

(3 ■ 1 Introduction 

^: 

f^ ■ Reconfigurable Computing. One of the cutting-edge aspects of modern reconfigurable computing is 

jy^ \ the possibility of partial reconfiguration of a device: A new module can be placed on a reconfigurable chip 

O ' whithout interfering the computation of other modules. Clearly, this approach has advantages over a full 

^ \ reconfiguration of the whole chip. However, there is still a tremendous need for scientific progress: The 

JIJ"! ' technical possibilities for partial reconfiguration have been somewhat restricted, and manufacturers have 

rS \ been slow in providing possibilities, tools, and documentation. As a consequence, there has only been a 

j^ ■ limited amount of previous research on this topic. 

New reconfigurable devices such as FPGAs offer increasing levels of partial reconfigurability, and chip 
sizes continue to grow. At the same time, static programming methodologies show an increasing use of 
pre-implementation by means of relocatable module libraries with bounding-box restrictions. 

These developments place an ever-growing demand on the run-time management of resource allocation. 
As these tasks become more and more complex, one needs support in the form of operating systems [2] for 
managing both software and hardware processes (see Figure H) 

*A short abstract summarizing the results of this paper appeared in the Proceedings of the 14th International Conference 
on Field- Programmable Logic and Application (FPL '04) [!]■ 
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Figure 1: An operating system for reconfigurable computers. 



Runtime space allocation, also known as temporal placement, is a central part in partially reconfigurable 
computing systems. In this paper, we present methods to solve two crucial issues for devices that allow for 
partial reconfiguration: 

1. Given a set of n rectangular modules that have been placed on a chip, identify all feasible positions for 
a new module. 

2. Given a set of n rectangular modules that have been placed on a chip, a new module, and demands for 
connecting it to existing sites, find a feasible position for the module that minimizes the total weighted 
distance to the given sites. 

Related Work. The first of the above issues is the task of maintaining free space. Bazargan et al. [3] 
describe how to achieve this by maintaining the set of all maximal free rectangles; as this set can have 
size fl(n^), the complexity is quadratic. Alternatively, they propose partitioning free space into only 0{n) 
free rectangles; the price for this improved complexity is the fact that no feasible placement may be found, 
even though one exists. Walder et al. [4] have suggested ways to reduce this deficiency and did report on 
experimental improvement, but their 0{n) procedure is still a heuristic approach that may fail in some 
scenarios. Thus, there remains a gap between 0{n^) methods that report an accurate answer, and 0{n) 
heuristics that may fail in some scenarios. Ahmadinia et al. [5] suggested maintaining occupied space instead 
of free space, but (depending on the computational model) their approach is still quadratic. Other current 
work on free-space management was presented by Tabero et al. [6,7], who provide an 0{n^) approach based 
on keeping track of possible corner positions. 

The above difficulty may have contributed to the fact that routing-conscious placement has received 
hardly any attention at all: Clearly, optimal placement of a new module has to go beyond feasible placement. 
In the context of configurable computing, this second aspect has only been treated very recently, in work by 
Ahmadinia et al. [5], who suggest a heuristic to find a feasible placement for a new module that has small 
total weighted Euclidean distance to a given set of demand points. In the area of discrete algorithms, two 
papers study a somewhat related problem: Karp et al. [8] consider the problem of arranging a set of records 
in a 2-dimensional array, such that the total weighted distance is minimized. In this context, all records have 
the same size (unit cells), and no previous records have been placed; on the other hand, all records have to 
be placed at once, which is different from our scenario. It should be noted that for the case of Manhattan 
distances, the resulting shape for large numbers of records can only be described by a differential equation, 
indicating surprising computational difficulties. In more recent work. Bender et al. [9] consider the problem 
of allocating k processors in a grid supercomputer in the presence of occupied cells; this is a generalization 



of [8]. They also present empirical evidence that indeed the Manhattan distance between processors should 
be minimized for optimizing communication cost, and thus runtime of the resulting jobs. In this context, 
see also [10-12]. 

Our Results. Wc resolve both of the above issues: 

• Wc give a 0(nlog?7) method to provide a free-space manager (FSM). This approach uses a plane-sweep 
approach from computational geometry. 

• We give a matching lower bound of ri(n log n) for locating a maximal free rectangle between a set of n 
modules, showing that our method has optimal complexity 

• We show that our FSM can be extended to find a feasible position that minimizes total weighted 
Manhattan distance to existing sites. The resulting algorithm still has an optimal run time of O(rilogn). 

• We describe implementation details to illustrate that our method is fast and easy. 

• We provide experimental data to demonstrate the practical usefulness of our results. 

The rest of this paper is organized as follows. In Section [3 we present our optimal FSM. Section 
describes how to perform optimal routing-conscious placement. Section 01 shows implementation details and 
experimental data. The final Section El discusses possible implications and extensions of our work. 

2 The Free-Space Manager 

In this section we present our approach to free-space management. Our FSM is based on the observation 
that the occupied space consists of very simple geometric objects, namely n placed rectangular modules. Put 
simply our free-space manager is a modification of the well-known algorithm ContourOfUnionOfRect- 
ANGLES (CUR) [13-15], for finding the contour of a union of axis-parallel rectangles. As the number of 
contour segments is linear in n, we achieve a running time of 0{nlogn). Note that we do not require the 
contour to be connected, i.e., our approach works even if there are holes in the arrangement. 

2.1 Free-Space Manager Basics 
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Figure 2: A set of existing modules (white) on a rectangular chip and an additional module that has to be 
placed on the remaining free chip area (dark). 

We consider an FPGA or other reconfigurable device and denote its width by W and its height by H. 
Assumsing a coordinate origin in the lower left corner of the chip, we can describe the corresponding input 



by the quadruple F = (0, 0, W, H). On the device, a set M = {{xi,yi, Wi,hi) : i £ {1,2,..., n}} of modules 
rrii with widths Wi and heights hi has been placed, with lower left corners at positions [xi, yi). The task for 
the free-space manager is to identify regions where a new module m with width Wm and height hm can be 
placed. 
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Figure 3: Expanding existing modules and shrinking chip area and the new module reduces free-space man- 
agement to managing free-space for a single point. 
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Figure 4: Resulting free space (dark). 



As mentioned above, prior free-space managers maintained lists of unoccupied rectangles. Because the 
number of maximal empty rectangles is quadratic in the number of modules this clearly leads to quadratic 
running times. In [5], Ahmadinia ct al. described how the management of free space can be simplified to 
finding a placement for a single point (see Figure OJ by transforming the problem data as follows: Shrink 
the area of the chip and simultaneously blow up the existing modules by half the width and half the height 
of the new module. This way the chip area is given by 

'^2'2' 2' 2 ' 



and the set of placed modules M clipped by F' becomes 

M' = {(xU>U0:*e{l,2,...,n}} 
with 






y,, =max{y, - — >— )> 

w[ = min{m, + w™, W - w„}, 

/i- = min{/7,, + k,n, H - hm}- 

The new module m reduces to a point m' and the original problem of finding free space for a rectangle 
reduces to finding free space for that point. 

For the module to be placed shown in Figure O the result of the transformation is shown in Figure 01 All 
points within and on the border of the shaded areas are feasible placement locations. 

2.2 Representation of Free Space 

Among all feasible placements, all points on the contour of the free space are feasible. As we will see in the 
next Section O these positions are of particular interest when trying to preserve a good structure of free 
space, and minimizing total communication distance. 

In general finding the contour of a set of n axis-aligned rectangles can be done in 0(n log ri + s) by 
using the CUR algorithm as described in [13-15]. Here s is the complexity of the resulting contour. Our 
algorithm is not simply an implementation of CUR. There arc a few subtlelties that have to be considered. 
All differences stem from the above mentioned fact, that the points on the contour arc feasible placements. 
As a consequence our algorithm has to find free space of height and width (see Figure |3Jl . In the following 
we will describe CUR and our modifications to it. 

The building blocks of CUR are an algorithmic technique from computational geometry called plane 
sweep and a data structure called segment tree. For an in-depth introduction to both see [16]. 

A plane-sweep algorithm is an algorithm that scans the plane and a set of objects in it: Move an axis- 
parallel line in an orthogonal direction across the plane and keep track of the structure of the intersection 
with the set of objects. The key observation is to notice that updates to this structure only occur at a discrete 
set of critical positions called events. By pre-sorting these events (in time &{nlogn)), only the updates have 
to be performed, which can be done cfhcicntly for all events by using an appropriate data structure. For our 
purposes, such a data structure is a segment tree: This is a balanced binary tree for dynamically storing a 
set of n intervals. The number of endpoints of these intervals must be known at construction time. Because 
it is bounded by 2n the segment tree can be constructed in 0{n). Insertion and deletion of intervals can be 
done in O(logn). Segment trees as used in [15,17] have been introduced in [18]. See [19] for more details. 

One has to be careful when constructing the segment tree. To find free space of height and width 
we have to make sure that two modules starting or ending on the same coordinate are separated by an 
elementary interval in the segment tree. This can be done by disturbing the top left corner of each module 
by a sufficiently small e > 0. 

The crucial part of our algorithm are two plane sweeps: one horizontal sweep that discovers all the 
vertical contour segments and one vertical sweep that finds all horizontal segments. As the horizontal and 
the vertical sweep differ only in the initialization, we only describe the horizontal sweep. 

For the horizontal sweep we construct a list L of 2n quadruples, denoted by (pi, ti,hi,ei): for each of the 
modules in M', we add two elements to L — one for the left side (x^, Open, yl, y[ -{- h[) and one for the right 
side {x[ + w[, Close, y[, y[ -\- h[). This list is sorted lexicographically and we assume that Close < Open. 
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Figure 5: (Left) A set of rectangles, representing expanded modules (dashed, light). (Center) The contour as 
returned by CUR, and the resulting feasible space for placing the center point of the new module (dark, with 
boundary) . (Right) Correct feasible positions for placing the center point of the new module, as returned by 
our FSM (dark, with boundary.) 

In the sweep we process all 277, elements of L. In case of an Open event the corresponding contour points 
are retrieved from the segment tree and the segment [y', j/' + h'A is added to the tree. For a Close event the 
segment [y'^, y[ + ft,'] is removed from the tree and the corresponding contour points arc retrieved. 

In the CUR algorithm wc would construct the horizontal contour segments from the vertical segments. 
In our setting wc might not find all free space of height 0. So wc need to do another vertical plane sweep to 
discover all horizontal segments. 



2.3 Combinatorial Complexity of the Free-Space Contour 

In this section we will show that the combinatorial complexity of the contour of the free space is linear in 
the number of modules. We thereby show that the complexity of our algorithm is 0{n\ogn). 

In general, the contour of a union of 77 rectangles may consist oifl{n^) line segments, e.g., when considering 
two sets of pairwisc overlapping 77/2 axis-parallel strips, where the intersections form a grid pattern. This is 
not the case for the sets of rectangles arising as expanded modules; in fact, we prove that an arrangement 
of rectangles as in Figure is impossible. As this is the only arrangement of two rectangles for which the 
number of edges forming the contour exceeds eight, this can be used as a stepping stone for an inductive 
proof that the contour never consists of more than An line segments. 

Theorem 1 The expanded regions m[, 777'- g M' for two existing, disjoint modules nii, nij G M cannot cross. 

Proof: Every expanded module m^ is a rectangle, described by four bounding coordinates: x'^, the 
position of its left edge; x'^+w'^, the position of its right edge; y,-, the position of its lower edge; y'^ + h'^, the 
position of its upper edge. Now assume there arc two expanded modules that cross, say, 7771 and 7772, and 
consider without loss of generality that x'l < x'2 . Then crossing means that X2 + W2 < x'l + w'2, 1/2 < y'l, and 
y[ + h'l < y'2 + h'2. But as all expanded modules arise from the original modules by moving their edges by 
the same amount for each coordinate direction, the same relative order must be valid for the original edges. 
This implies that the original modules cross, contradicting their disjointncss. □ 

Theorem 2 FindContourSegments finds 0{n) contour segments. 

Proof: We will argue that the number of segments of the contour is at most the number of segments 
of all modules. 
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Figure 6: An impossible, crossing position for expanded modules. 



Consider the frontier Oi = {[61, ei], . . . , [60, Gq]} of all open segments just before encountering event 

J^i = (Pii ti, Oi, ei). 

Let us consider an Open event Li. Processing this event yields at most \0i\ new contour segments. This 
is the case if and only if all elements [bj, ej] S Oi are completely contained in [bi, e^] and do not overlap. But 
because no two modules can cross, all Close events Lji must occur before the Close event L^/. Consequently 
the closing segments [bji , ej'] do not contribute to the contour and in total the number of segments added is 
at most one. 

Now let us consider a Close event Li. Arguing as above, we conclude that if [6^,6,] is totally contained 
in an element [bj, ej] G Oi, the total number of segments added to the contour is at most one. 

In all other events at most one segment is added to the contour. As there are not more than 2n events 
in each cordinate direction, the number of contour segments parallel to each axis is at most 2n. Thus, the 
number of contour segments is at most 4n, i.e., linear in n. □ 

2.4 Computational Complexity 

Theorem 3 The complexity of FindContourSegments is O(nlogn). 

Proof: The algorithm CUR has a running time of 0(nlogn + s), where s is the number of segments 
of the contour. As we have shown in Theorem [3 s = 0{n). Our modifications to CUR do not increase 
the running time by more than a constant factor. Thus the running time of FindContourSegments is 
O(nlogn). D 

2.5 Lower Bound 

Assuming a standard computational model, we can show that our FSM has optimal running time, by 
providing a matching lower bound: 

Theorem 4 In the algebraic tree model of computation, there is a lower bound ofQ{n\ogn) on the complexity 
of deciding the maximum size of a free rectangle between n existing rectangles. 

Proof: The claim is already true in one dimension, for n existing unit intervals, with positions described 
by n integers ai, . . . , a„. Determining a maximum free interval is precisely the problem Maximum Gap. 

The Maximum Gap problem is the problem of determining the maximum gap between two consecutive 
numbers of a set A = {oi, 02, . . . , a„} of n numbers. Two elements of A arc called consecutive if they appear 
consecutively in the sequence obtained by sorting A. The running time of Maximum Gap is bounded from 
below by n{n\ogn), as described in Chapter 6 of [15]. □ 



3 Routing- Conscious Placement 

After describing how to find all feasible placements for a new module, we turn to finding an optimal placement, 
sucli that the weighted communication cost is minimized. As described in the introduction, an appropriate 
measure for this cost is the Manhattan distance between modules, weighted by the relative amount of 
communication. This can still be achieved in time 0(nlogn), making use of local optimality properties, our 
FSM, and another application of plane sweep techniques. 

3.1 Model 

Given F' and M' as in Section |21 the objective of the placer is to find a point in free space that minimizes 
communication cost for the new module m. 
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Figure 7: (Left) Physical chip with one module and its connectivity to other modules drawn by dotted lines 
(Right) Dotted lines with bandwidth attribute 6™ to denote required connectivity of module m to other modules 
i. 

In our routing-conscious approach we let the communication cost of an additional module m depend on 
the distance to the centers of a subset of existing modules, which is measured in the Manhattan metric. We 
may also consider communication with the chip boundary as indicated in Figured So we have to consider 
the set C = {{xi^yi), . . . , {xk,yk)} of demand points for communication. Clearly, we may assume that the 
number k of connections to be established with module m is linear in n. A second factor in communication 
cost is the width 6™ of the communication path needed to create a routing unit between modules i and to; 
this needs to be taken into account as a multiplicative factor. Thus, we get the objective function 



l{c(a^'„, y'ra) ■■ {4n,y'm) ^ F' \ |J ™-} 



ni'eM' 



with 



c{x,y) = ^6™||(x,;,yj) - (a;,y)||i. 



Because we are dealing with the Manhattan metric, this can be reformulated to 

c{x,y) = c^(x)+cy(2/) 



with 



c^{x)=Y,bT\x^-x\, 



and 



c^y) = Y.^T\y^ 



y\- 



If we would allow m' to be placed anywhere on the chip this can be reformulated to 

inin{c^(y^) : y'^ e [^,i?--|^]}. 

As a consequence we can consider two separate minimization problems, one for each coordinate. If we ignore 
feasibility, both minima are attained in the respective weighted medians, so they can be computed in linear 
time [20]; as we already sort the coordinates for performing plane sweep, this running time is not critical, so 
we may as well use a trivial method. Note that only medians satisfy unconstrained local optimality, as the 
gradients for c^ and c^ are simply 



vc-(x) = 5] r - E ^r, 



X-i <X X-i >x 

the sum of the required bandwidths to the left minus the sum of the required bandwidths to the right and 

yi<y Vi>v 

the sum of the required buswidths to the bottom minus the sum of the required buswidths to the top. 

3.2 Local Optimality 

As we have seen in the previous subsection the median is the globally optimal point if it is not in the occupied 
space. If it is in the occupied space there are only two other types of points where the global optimum could 
be located. 

All points of one type can be found by intersecting the contour of the occupied space with the median 
axes Ix = {{xmed,y) ■ y G [0,H]} and ly ~ {{x,ymed) '■ x £ [0, M^]}. In these points one of the gradients 
Vc^ and Vc^ vanishes. We cannot move in the direction of a better solution because that way is blocked by 
either a vertical or a horizontal segment of the contour. 

The other type of points are some of the vertices of the contour. These points are the intersections of 
horizontal and vertical segments forming an interior angle of ^ pointing in the direction of the median. In 
these points neither of the gradients vanishes. Either of the directions indicated by the gradient are blocked 
by contour segments. 

By simply inspecting all the local optima one finds the global optimum. In the next subsection we 
describe how this can be done efficiently. 




Figure 8: All potentially optimal points are marked by circles. As the median is located in the occupied space 
one of the other points must be optimal. The arrows show the directions in which the solution value would 
improve. 

3.3 Algorithm for Global Optimality 

The FSM described in the previous section computes the contour of the occupied space in 0{nlogn). A 
simple algorithm that finds the optimal point to place a new module m would compute the median and 
check its feasibility; if the outcome is positive, we have found the optimum. Otherwise we need to check 
communication cost for all other possible local minima, i.e., for every vertex of the contour and every 
intersection point of the contour with one of the median axes. Let L denote this set of points. Computing 
commuication cost for a single point takes 0{n), so evaluating all objective values in a brute- force manner 
would take O(n^) time. However, by means of two more plane sweeps, we can achieve a complexity of 
0{n\ogn). 

For this purpose, we observed that communication cost for the x- and y-coordinate of the contour segments 
can be computed separately, then add the precomputed values for every point of L. The crucial step is to 
use the fact that we only need to compute the communication cost for the leftmost a;-coordinate and for the 
bottommost y-coordinate; the other values can be obtained by doing appropriate fast updates during the 
plane sweep. Below we give details of this step; for ease of presentation, we add the communcation points 
CtoL — L' = LL) C and only describe updates for a:-coordinates. 

First we sort (in time 0(71 log n)) L' by increasing x-coordinate. Next we remove all points that are 
located on the same cc-coordinate in 0{n) time. With the convention that fo™ = if {xi,y.i) E L\C we 
compute the required bandwidth to the left and to the right of each point i by li ~ 0, li = /i_i + 5™, r^^i^ = 0, 
and ri ~ ri^i + &'". These values can be computed by a forward and a backward scan in 0{n) time. This 
yields the following recursive formula 



-i+l 



Ci + {k - ri+i){xi+i -Xi) 



for computing communication cost for all x-coordinates in time linear in n. 

As the lower bound from the previous section still applies, we get the following: 

Theorem 5 A feasible position with minimum communication cost can be computed in time O(nlogn). 
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Running Time 
RCP NAO 


(ms) 
KFF 


Routing Cost 
RCP NAO 


KFF 


Rejection Rate 
RCP NAO 


KFF 


Uniform 5-10% 


173 


197 


204 


1403 


3641 


9522 


0% 


0% 


0% 


Uniform 10-15% 


162 


208 


194 


1747 


5490 


14311 


0% 


1% 


0% 


Uniform 15-20% 


160 


172 


158 


2044 


7250 


19791 


2% 


5% 


1% 


Uniform 20-25% 


156 


181 


161 


1987 


7061 


20159 


10% 


12% 


9% 


Uniform 5-25% 


168 


224 


215 


1721 


6741 


21347 


5% 


8% 


5% 


Increasing 5-25% 


196 


252 


243 


1931 


6914 


21910 


8% 


14% 


6% 


Decreasing 25-5% 


175 


232 


228 


611 


2311 


11712 


0% 


3% 


4% 


Average 


170 


209 


200 


1635 


5630 


16965 


3.6% 


6.1% 


3.6% 



Table 1 : Experimental results for the different benchmark instances. Overall running time, average routing 
cost for each module, and rejection rate are shown for the different algorithms. RCP denotes the algorithm 
described in this paper, NAO refers to the algorithm as described in [5] and KFF is the algorithm KAMER 
combined with First Fit as presented in [3] 



4 Experimental Results 

The running time of our algorithm is not only good in theory, but also quite practical (as constants are small) 
and easy to implement. Here we show some results of our implementation. See Tabled for an overview. 

We randomly generated two different kinds of benchmark instances. All of the instances describe a 
scenario in which 100 modules have to be placed on an initially empty chip of size 80 x 120. Each module 
stays on the chip until at least a certain number of new modules have been placed. Then it is removed from 
the chip. This number is different for each module and is randomly drawn from the interval [4, 100]. Each 
of the modules needs to communicate with the border of the chip and with all the modules located on the 
chip. The buswidth is drawn for the interval [0, 10]. 

The instances differ in module size and distribution of the sizes. We have generated instances where all 
modules have roughly the same size (5-10%, 10-15%, 15-20%, and 20-25% of the chip size). These instances 
are called uniform since the sizes are distributed uniformly. We also have created instances where module 
sizes vary from 5 to 25% of the chip size. Here we consider three different kinds of distributions - uniform, 
increasing, and decreasing. In the increasing and decreasing case the modules are sorted by size. 

Given these instances we benchmarked a gH — h 3.2 compiled C++ implementation of our algorithm against 
the algorithms described in [5] and [3] . Shown in the first set of columns in the table is a comparison of 
average running times for 100 modules for each instance in milliseconds on a 2.53GHz Intel Pentium 4 PC 
running under the linux operating system. Remarkably, our algorithm on average has the fastest running 
time, even though it computes much better solutions. This illustrates the superiority of a plane-sweep 
apporach. Clearly, the difference in running times will increase for even larger instances. 

The second set of columns compares the average routing cost per module. Routing costs are measured 
according to the weighted Manhattan distance, which reflects the fact that routing on the chip is done in an 
axis-parallel manner. Note that in [5], placement is done according to a weighted Euclidean distance, and 
optimization is only done heuristically. As a consequence, the objective values are markedly higher. [3] does 
not take routing cost into account and places by some bin-packing like heuristic trying to minimize rejection 
rate. This may result in modules being placed all over the chip, regardless of communication cost. As a 
result, communication cost is one order of magnitude higher than for our method. 

As a matter of fairness, we give a third set of columns, comparing the average number of modules that 
had to be rejected due to lack of space on the chip, which is one of the objectives in [3]. Note that this 
rejection rate does not play any role during the course of our algorithm, nor is it considered in [5]. It is 
striking that nevertheless, the total number of rejected modules for our algorithm is precisely the same as 
for [3]. Again, our results dominate the ones for [5] by a clear margin. 
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In summary, our algorithm is faster except for case uniforml5-20%, better and more robust against 
rejection than the method described in [5]. It is also faster, much better and as robust against rejection as 
the approach described in [3]. 

5 Conclusion 

We have shown how to deal with two crucial issues supporting partial reconfigurability in reconfigurable 
computing. This raises hope of achieving further progress for even more complicated scenarios. One such 
aspect is to streamline our data structures and algorithms for repeated insertion or removal of modules. 
Some computational work can be saved, as sorting from scratch is no longer required. While this makes it 
relatively straightforward to lower the resulting complexity of dealing with n changes in total time O(n^), 
it remains an open challenge to decide whether a subquadratic complexity is possible, as no appropriate 
techniques for establishing quadratic lower bounds are known. However, it is conceivable that we may be 
facing a 3SUM-hard problem, which is the next best thing to an explicit lower bound. See [21] for techniques 
used for showing this for other geometric problems. 

An even more interesting challenge arises by venturing from "routing-conscious" placement to "routing- 
optimal" placement: When routing among existing modules, we may have to consider them as obstacles for 
our paths. Thus, distances are not straight-line Manhattan distances, but geodesic Manhattan distances, 
i.e., given by shortest paths among obstacles. We believe that this type of problem can still be dealt with 
efficiently by using the techniques of Mitchell [22] . This has been done in [23] for an application to a routing- 
optimal placement problem for a continuum of demand points. For dynamic routing requests at runtime, 
principles that have been investigated include Dynamic Network on Chip (DyNoC) [24] and Honeycomb [25]. 

As our algorithm considers placing one module at a time, it is an interesting problem to consider the 
more complex task of placing the full set of modules at once. This is considered in [9] for the scenario of all 
processors being of the same size; even without existing modules and uniform routing cost, this turns out to 
be a tough problem, as noted in [8] . We hope to provide results on this scenario for modules of differing size 
and non-uniform routing cost in the near future. 

Finally, it should be interesting to consider placement of modules as an online problem, where only 
limited information is available at each stage. Interesting scenarios require an appropriate modeling of the 
objective function considered, in particular for the tradeoff between computing cost, routing cost, and the 
cost of rejecting modules. 
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